Weekly Work Summary

The bulk of my workload this week went to developing a visualization method for looking at the tweets over time and working towards implementing sentiment analysis directly into Elasticsearch. For the visualization, I have committed an R function called plot_tweet_sentiment_timeseries.R to /COVID-Twitter/analysis on GitHub from my branch hacl-campoh, which takes a dataframe of tweets and returns a ggplot object with the visualization of sentiments over time (shown later in this notebook). I’m currently working with Abraham and Rachel on the backend to attach sentiment information to the tweets from Elasticsearch and greatly shorten the time we must spend analyzing sentiment related data.

Our plan is to firstly use the vaderSentiment Python implementation of VADER and assign sentiment scores to each Tweet in the database. We then plan to use the Elasticsearch aggregation methods to quickly retrieve tweet sentiment data over time.

Furthermore, we also plan of upgrading from VADER to a pretrained BERT model in the future with a change to the last layer, so as to give continuous sentiment score results in the same scale as VADER.

GitHub Branch & Commits

  • Branch Name: hacl-campoh
  • Commit #16 to /COVID-Twitter/analysis: R function for visualizing tweet sentiment trends over time
    • This is the definition of plot_tweet_sentiment_timeseries.R

Shared Code

Personal Contribution

Discussion of Primary Findings

When writing plot_tweet_sentiment_timeseries.R, I was interested in what kinds of different sentiment trends each different clusters displayed over time. Therefore, to demonstrate its behavior we’ll apply it to a sample from coronavirus-data-masks and its clusters as determined by k-means. We take 10,000 random tweets from the coronavirus-data-masks dataset with dates ranging from January 1st, 2020, to August 1st, 2020.

###############################################################################
# Get the tweets from Elasticsearch using the search parameters defined above
###############################################################################

results <- do_search(indexname="coronavirus-data-masks", 
                     rangestart=rangestart,
                     rangeend=rangeend,
                     text_filter=text_filter,
                     semantic_phrase=semantic_phrase,
                     must_have_embedding=TRUE,
                     random_sample=random_sample,
                     resultsize=resultsize,
                     resultfields='"created_at", "user.screen_name", "user.verified", "user.location", "place.full_name", "place.country", "text", "full_text", "extended_tweet.full_text", "created_at", "embedding.use_large.primary", "dataset_file", "dataset_entry.annotation.part1.Response", "dataset_entry.annotation.part2-opinion.Response"',
                     elasticsearch_host="lp01.idea.rpi.edu",
                     elasticsearch_path="elasticsearch",
                     elasticsearch_port=443,
                     elasticsearch_schema="https")

# this dataframe contains the tweet text and other metadata
tweet.vectors.df <- results$df[,c("full_text", "user_screen_name", "user_verified", "user_location", "place.country", "place.full_name", "created_at")]

# this matrix contains the embedding vectors for every tweet in tweet.vectors.df
tweet.vectors.matrix <- t(simplify2array(results$df[,"embedding.use_large.primary"]))
###############################################################################
# Clean the tweet and user location text, and set up tweet.vectors.df 
# the way we want it by consolidating the location field and computing
# location type
###############################################################################

tweet.vectors.df$user_location <- ifelse(is.na(tweet.vectors.df$place.full_name), tweet.vectors.df$user_location, paste(tweet.vectors.df$place.full_name, tweet.vectors.df$place.country, sep=", "))
tweet.vectors.df$user_location[is.na(tweet.vectors.df$user_location)] <- ""
tweet.vectors.df$user_location_type <- ifelse(is.na(tweet.vectors.df$place.full_name), "User", "Place")

clean_text <- function(text, for_freq=FALSE) {
  text <- str_replace_all(text, "[\\s]+", " ")
  text <- str_replace_all(text, "http\\S+", "")
  if (isTRUE(for_freq)) {
    text <- tolower(text)
    text <- str_replace_all(text, "’", "'")
    text <- str_replace_all(text, "_", "-")
    text <- str_replace_all(text, "[^a-z1-9 ']", "")
  } else {
    text <- str_replace_all(text, "[^a-zA-Z1-9 `~!@#$%^&*()-_=+\\[\\];:'\",./?’]", "")
  }
  text <- str_replace_all(text, " +", " ")
  text <- trimws(text)
}
tweet.vectors.df$full_text <- sapply(tweet.vectors.df$full_text, clean_text)
tweet.vectors.df$user_location <- sapply(tweet.vectors.df$user_location, clean_text)

In order to determine a good number of clusters, we use the “elbow method” in the absence of a more automated technique, subjectively selecting 17 clusters to be used.

wssplot <- function(data, fc=1, nc=40, seed=20){
  wss <- data.frame(k=fc:nc, withinss=c(0))
  for (i in fc:nc){
    set.seed(seed)
    wss[i-fc+1,2] <- sum(kmeans(data, centers=i, iter.max=30)$withinss)}

  ggplot(data=wss,aes(x=k,y=withinss)) +
    geom_line() +
    ggtitle("Quality (within sums of squares) of k-means by choice of k")
}
wssplot(tweet.vectors.matrix)

Next, we add a new column to the dataframe of tweets consisting of the VADER compound sentiment score to speed up the visualization function’s running time. We must note that this was written before Abraham’s implementation of VADER sentiment directly into Elasticsearch, which immensely speeds this part of the process

####################################################
# Compute and attach tweet sentiment to each tweet
####################################################

tweet.vectors.df$sentiment <- c(0)
sentiment.vector <- rep(NA, length(tweet.vectors.df$sentiment))
for (i in 1:length(tweet.vectors.df$sentiment)) {
  tryCatch({
      sentiment.vector[i] <- get_vader(tweet.vectors.df$full_text[i])["compound"]
    }, error = function(e) {
      sentiment.vector[i] <- NA
    })
}
#sentiment.vector <- vader_df(tweet.vectors.df$full_text)[,"compound"]
tweet.vectors.df$sentiment <- sentiment.vector
tweet.vectors.df <- tweet.vectors.df[!is.na(sentiment.vector),]
tweet.vectors.matrix <- tweet.vectors.matrix[!is.na(sentiment.vector),]

We then cluster the tweet embedding using k-means. For the purposes of demonstration, we do not perform subclustering since in the absence of some technique for automatically identifying topics, the topics contained in the high level clusters are easier to interpret than the sparser subclusters from a word frequency point of view.

###############################################################################
# Run K-means on all the tweet embedding vectors
###############################################################################

# Number of clusters
k <- 17 

set.seed(300)
km <- kmeans(tweet.vectors.matrix, centers=k, iter.max=30)

tweet.vectors.df$vector_type <- factor("tweet", levels=c("tweet", "cluster_center", "subcluster_center"))
tweet.vectors.df$cluster <- as.factor(km$cluster)

#append cluster centers to dataset for visualization
centers.df <- data.frame(full_text=paste("Cluster (", rownames(km$centers), ") Center", sep=""),
                         user_screen_name="[N/A]",
                         user_verified="[N/A]",
                         user_location="[N/A]",
                         user_location_type = "[N/A]",
                         place.country = "[N/A]",
                         place.full_name = "[N/A]",
                         created_at = "[N/A]",
                         vector_type = "cluster_center",
                         cluster=as.factor(rownames(km$centers)),
                         sentiment=NA)
tweet.vectors.df <- rbind(tweet.vectors.df, centers.df)
tweet.vectors.matrix <- rbind(tweet.vectors.matrix, km$centers)

Next, we find the most common words in each cluster so as to better be able to interpret them. Ideally, this would be replaced by some more automatic method in the future.

###############################################################################
# Compute labels for each cluster  based on word frequency
# and identify the nearest neighbors to each cluster center
###############################################################################

stop_words <- stopwords("en", source="snowball")
stop_words <- union(stop_words, stopwords("en", source="nltk"))
stop_words <- union(stop_words, stopwords("en", source="smart"))
stop_words <- union(stop_words, stopwords("en", source="marimo"))
stop_words <- union(stop_words, c(",", ".", "!", "-", "?", "&amp;", "amp"))

get_word_freqs <- function(full_text) {
  word_freqs <- table(unlist(strsplit(clean_text(full_text, TRUE), " ")))
  word_freqs <- cbind.data.frame(names(word_freqs), as.integer(word_freqs))
  colnames(word_freqs) <- c("word", "count")
  word_freqs <- word_freqs[!(word_freqs$word %in% stop_words),]
  word_freqs <- word_freqs[order(word_freqs$count, decreasing=TRUE),]
}

get_label <- function(word_freqs, exclude_from_labels=NULL, top_k=3) {
  words <- as.character(word_freqs$word)
  exclude_words <- NULL
  if (!is.null(exclude_from_labels)) {
    exclude_words <- unique(unlist(lapply(strsplit(exclude_from_labels, "/"), trimws)))
  }
  label <- paste(setdiff(words, exclude_words)[1:top_k], collapse=" / ")
}

get_nearest_center <- function(df, mtx, center) {
  df$center_cosine_similarity <- apply(mtx, 1, function(v) (v %*% center)/(norm(v, type="2")*norm(center, type="2")))
  nearest_center <- df[order(df$center_cosine_similarity, decreasing=TRUE),]
  nearest_center <- nearest_center[nearest_center$vector_type=="tweet", c("center_cosine_similarity", "full_text", "user_location")]
}

master.word_freqs <- get_word_freqs(tweet.vectors.df$full_text)
master.label <- get_label(master.word_freqs, top_k=6)

clusters <- list()
for (i in 1:k) {
  cluster.df <- tweet.vectors.df[tweet.vectors.df$cluster == i,]
  cluster.matrix <- tweet.vectors.matrix[tweet.vectors.df$cluster == i,]
    
  cluster.word_freqs <- get_word_freqs(cluster.df$full_text)
  cluster.label <- get_label(cluster.word_freqs, master.label)
  cluster.center <- cluster.matrix[cluster.df$vector_type=="cluster_center",]
  cluster.nearest_center <- get_nearest_center(cluster.df, cluster.matrix, cluster.center)
 
  
  clusters[[i]] <- list(word_freqs=cluster.word_freqs, label=cluster.label, nearest_center=cluster.nearest_center)
}

We now visualize the clusters and their sentiments using t-SNE and the plot_tweet_sentiment_timeseries.R function. The first sentiment time series plot relates to the entire sample, while the following ones correspond to the clusters in ascending order (I plan on implementing a custom title functionality very soon after encountering this inconvenience).

###############################################################################
# Run T-SNE on all the tweets and then plot sentiment time series for clusters
###############################################################################

set.seed(700)
tsne <- Rtsne(tweet.vectors.matrix, dims=2, perplexity=25, max_iter=750, check_duplicates=FALSE)
tsne.plot <- cbind(tsne$Y, tweet.vectors.df)
colnames(tsne.plot)[1:2] <- c("X", "Y")
tsne.plot$full_text <- sapply(tsne.plot$full_text, function(t) paste(strwrap(t ,width=60), collapse="<br>"))
tsne.plot$cluster.label <- sapply(tsne.plot$cluster, function(c) clusters[[c]]$label)

cluster.sentiment.plots <- list()

#Master high level plot
fig.master <- plot_ly(tsne.plot, x=~X, y=~Y, 
               text=~paste("Cluster:", cluster,"<br>Text:", full_text), 
               color=~cluster.label, type="scatter", mode="markers")
fig.master <- fig.master %>% layout(title=paste("Master Plot:", master.label, "(high level clusters)"), 
                        yaxis=list(zeroline=FALSE), 
                        xaxis=list(zeroline=FALSE))
fig.master <- fig.master %>% toWebGL()
fig.master
#Master level tweet sentiment by day plot for the entire sample
fig.master.sentiment <- plot_tweet_sentiment_timeseries(tweet.vectors.df, group.by = "week", plot.ma = TRUE) 

#Cluster sentiment plots
for (i in 1:k) {
  print(paste("Plotting cluster", i, " sentiment time series..."))
  fig <- plot_tweet_sentiment_timeseries(tweet.vectors.df[tsne.plot$cluster == i,], group.by = "week", plot.ma = TRUE) 
  cluster.sentiment.plots[[i]] <- fig 
}
## [1] "Plotting cluster 1  sentiment time series..."

## [1] "Plotting cluster 2  sentiment time series..."

## [1] "Plotting cluster 3  sentiment time series..."

## [1] "Plotting cluster 4  sentiment time series..."

## [1] "Plotting cluster 5  sentiment time series..."

## [1] "Plotting cluster 6  sentiment time series..."

## [1] "Plotting cluster 7  sentiment time series..."

## [1] "Plotting cluster 8  sentiment time series..."

## [1] "Plotting cluster 9  sentiment time series..."

## [1] "Plotting cluster 10  sentiment time series..."

## [1] "Plotting cluster 11  sentiment time series..."

## [1] "Plotting cluster 12  sentiment time series..."

## [1] "Plotting cluster 13  sentiment time series..."

## [1] "Plotting cluster 14  sentiment time series..."

## [1] "Plotting cluster 15  sentiment time series..."

## [1] "Plotting cluster 16  sentiment time series..."

## [1] "Plotting cluster 17  sentiment time series..."

word.freq.list <- htmltools::tagList()
for (i in 1:k) {
  # Print cluster word frequencies
  if (isTRUE(show_word_freqs)) {
    word.freq.list[[i]] <- htmltools::HTML(kable(clusters[[i]]$word_freqs[1:10,], caption=paste("Cluster", i, "Top 10 Words")) %>% kable_styling())
  }
}
word.freq.list
Cluster 1 Top 10 Words
word count
572 corona 475
1581 mask 465
2865 wear 124
2809 virus 97
2870 wearing 83
1913 people 80
914 face 61
1588 masks 32
2382 social 28
1199 home 23
Cluster 2 Top 10 Words
word count
4103 virus 744
2376 mask 725
4182 wear 327
2868 people 208
4188 wearing 176
3614 stay 97
1819 home 82
901 covid19 79
3644 stop 79
3589 spread 78
Cluster 3 Top 10 Words
word count
2492 mask 573
979 covid 246
4387 wear 243
981 covid19 211
2971 people 117
4396 wearing 117
2502 masks 83
2951 patients 82
1904 hospital 71
4468 work 54
Cluster 4 Top 10 Words
word count
2183 mask 631
827 coronavirus 305
865 covid19 255
1293 face 170
3893 wearing 143
3884 wear 118
616 cases 66
2163 mandate 61
2804 public 55
2192 masks 54
Cluster 5 Top 10 Words
word count
1950 mask 473
810 covid19 356
3439 wear 306
1185 face 208
2988 stay 146
1431 hands 143
2918 social 139
1514 home 103
2965 spread 103
3415 wash 102
Cluster 6 Top 10 Words
word count
2322 mask 798
912 covid19 424
1371 face 297
860 coronavirus 289
2350 masks 138
1378 facemask 91
910 covid 73
2276 make 50
4064 wear 48
3990 virus 45
Cluster 7 Top 10 Words
word count
667 covid 551
1733 mask 541
3059 wear 168
3066 wearing 98
2099 people 73
17 19 44
995 face 33
1738 masks 29
2251 put 24
723 damn 21
Cluster 8 Top 10 Words
word count
1918 n95 424
1775 masks 301
703 covid19 175
1772 mask 163
659 coronavirus 100
3168 virus 75
702 covid 52
2252 ppe 46
2118 pandemic 42
2847 surgical 38
Cluster 9 Top 10 Words
word count
1223 covid19 315
3119 mask 180
5485 wearamask 178
1170 coronavirus 109
1222 covid 74
3764 people 69
903 cases 55
77 2 41
5482 wear 39
12 1 36
Cluster 10 Top 10 Words
word count
2840 mask 1116
4944 virus 821
2852 masks 334
5027 wearing 330
5022 wear 302
3462 people 226
4354 spread 152
3672 protect 149
4408 stop 118
1617 face 107
Cluster 11 Top 10 Words
word count
2555 mask 1083
4505 wear 544
943 covid19 434
4514 wearing 404
3039 people 236
942 covid 199
2572 masks 197
896 coronavirus 160
1467 face 152
3271 public 76
Cluster 12 Top 10 Words
word count
2322 mask 722
2726 pandemic 686
4056 wear 287
4064 wearing 205
2762 people 189
1325 face 76
875 covid19 74
2335 masks 66
2950 public 50
834 coronavirus 49
Cluster 13 Top 10 Words
word count
1831 mask 510
3134 trump 426
3352 wear 189
686 covid19 183
657 coronavirus 159
3356 wearing 122
3279 virus 102
2160 pandemic 74
685 covid 64
2204 pence 62
Cluster 14 Top 10 Words
word count
2766 mask 655
4829 virus 462
4914 wear 239
3351 people 184
4918 wearing 168
3267 pandemic 96
1060 covid19 95
771 cdc 82
2777 masks 82
4660 trump 76
Cluster 15 Top 10 Words
word count
2003 mask 527
2557 quarantine 161
3586 wearing 120
783 covid19 115
3579 wear 106
3518 virus 78
748 coronavirus 67
1175 face 64
782 covid 55
2381 people 55
Cluster 16 Top 10 Words
word count
1395 mask 279
1325 lockdown 257
1696 people 79
2464 wear 70
2469 wearing 66
523 covid19 37
2079 social 36
799 face 30
648 distancing 28
1404 masks 25
Cluster 17 Top 10 Words
word count
2714 mask 578
1040 covid19 292
4856 wear 197
1039 covid 171
4863 wearing 141
3284 people 124
1002 coronavirus 70
3604 realdonaldtrump 57
4588 trump 55
4104 social 54

Here we see very different trends in count, sentiment and divisiveness for different clusters. We highlight

Thus, we observe that although the distribution of sentiments across all tweets over time does not seem to give much useful information, the clusters returned by k-means do not only possess very distinctive trends in sentiment but may also be interpreted to arrive at claims about the public’s sentiment towards mask usage, social distancing, mask madates, etc.

Areas for Improvement